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ABSTRACT 


Computation  Techniques  for  Large  Scale 
Undiscounted  Markov  Decision 
Processes 


In  this  paper  we  consider  computation  techniques  associated  with  the 
optimization  of  large  scale  Markov  decision  processes.  Markov  decision 
processes  and  the  successive  approximation  procedure  of  White  are  described. 
Then  a procedure  for  scaling  continuous  time  and  renewal  processes  so  that 
they  are  amenable  to  the  White  procedure  is  discussed.  The  effect  of  the 
scale  factor  value  on  the  convergence  rate  of  the  procedure  and  insights 
into  proper  scale  factor  selection  are  given.  Finally,  various  methods  of 
achieving  computational  efficiency  during  execution  of  the  optimization  are 
considered. 
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Introduction 


One  of  the  most  powerful  modeling  tools  for  the  analysis  of  controlled 
probabilistic  systems  is  Markov  decision  processes.  If  the  system  can  be 
structured  as  a Markov  process  and  the  control  decisions  for  the  system  can 
be  defined  in  terms  of  the  relevant  system  costs  and  operational  character- 
istics (transition  probabilities) , then  there  exists  a wealth  of  theory  that 
can  be  used  to  find  the  best  (least  cost,  most  profitable)  set  of  decisions 
for  operating  the  system.  As  with  many  modeling  techniques,  real  probabilistic 
systems,  when  modelled  as  Markov  processes,  tend  to  have  large  numbers  of  system 
states.  The  result  is  that  for  many  interesting  and  important  systems,  the 
computational  aspects  are  overwhelming.  In  most  cases,  digital  simulation  is 
the  only  viable  alternative  modelling  tool.  However,  searching  for  "optimal" 
control  decisions  for  the  system  via  digital  simulation  is  at  best  a trial  and 
error  effort,  and  at  worst  a tedious,  expensive,  and  confusing  exercise  in 
experimental  design  and  response  surface  techniques. 

In  many  cases,  the  prospects  of  large  scale  optimization  of  Markov  decision 
processes  as  ^ alternative  to  digital  simulation  are  quite  good,  if  one  is 
-willing  to  tackle  the  computational  aspects.  In  this  paper  we  first  review 
various  forms  of  non-discounted  Markov  decision  processes  and  transform  each 
to  the  form  of  a standard  finite  state  and  action  Markov  decision  process. 

This  procedure  was  explicitly  used  by  Schweitzer  [20]  for  Markov  renewal  pro- 
grams and  involves  choosing  a parameter,  b,  for  the  transformation.  As  noted 
by  Schweitzer,  the  value  of  b influences  the  asymptotic  convergence  rate  when 
White's  iterative  procedure  [25]  is  used  to  solve  the  transformed  Markov 
decision  process.  We  present  theoretical  insights  into  the  determination  of 
a b which  yields  the  fastest  asymptotic  convergence.  In  practice,  one  cannot 


easily  find  this  optimal  b,  so  we  also  present  heuristic  rules  for  choosing 
b.  Computational  results  based  upon  the  heuristics  are  given  which  appear 
quite  promising.  Finally  for  completeness  we  briefly  review  several  other 
computational  techniques  used  in  solving  large  scale  Markov  decision  processes. 

Background  and  Problem  Transformations 

Consider  a finite  state,  discrete  time,  completely  ergodic  Markov  pro- 
cess which  is  controlled  by  a decision  maker.  For  each  of  the  N states  (i), 
at  each  transition  of  the  process,  the  decision  maker  chooses  an  action 

k = 1,  ...,  K^.  This  action  results  in  transition  probabilities  p^^,  j * 1,  N, 

V-  k 

and  a reward  (cost)  q^.  P^^  is  defined  as  the  probability  that  the  process, 
now  in  state  i and  under  policy  k will  move  to  state  j over  the  next 
transition  of  the  process,  q^  is  defined  as  the 
expected  reward  (cost)  over  the  next  transition  for  operating  the  system.  The 
problem  is  to  find  the  optimal  action  for  each  state.  Here  optimality  refers 
to  the  maximization  (minimization)  of  the  expected  reward  (cost)  rate  for  the 
process  in  steady  state.  This  quantity  is  referred  to  as  g,  the  gain  of  the 
process. 

Howard  [ 7 ] showed  that  for  a given  policy  set,  the  simultaneous  set  of 
linear  equations, 

Vj  + g - i - 1 N 

(1) 

v^-O 

could  be  solved  to  compute  the  gain  g of  the  process.  The  v^'s  are  the  relative 
rewards  (costs)  of  starting  the  process  in  state  i.  Howard  showed  that  the 
optimal  gain  for  the  process  could  be  obtained  using  a simple  policy  iterative 
algorithm  (Figure  1) . 
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It  should  be  noted  that  the  Howard  algorithm  is  essentially  a dual  approach 


to  the  linear  programming  approaches  developed  by  Wolfe  and  Dantzig  [26], 
Derman  [1],  Manne  [12],  and  Fox  [ 2]  (for  Semi  Markov  Decision  Processes). 
Computationally,  the  Howard  approach  is  much  more  efficient  than  the  L.P. 
approach.  As  a consequence,  the  L.P.  approach  is  not  considered  here. 

We  now  briefly  turn  our  attention  to  the  continuous  time  Markov  decision 
process  and  the  semi-Markov  decision  process  (which  itself  subsumes  both  the 
continuous  and  discrete  time  models  as  special  cases.)  However,  ultimately  we 

will  reduce  these  latter  two  cases  to  the  non-discounted  Markov  decision 
process  and  so  this  diversion  is  provided  only  for  completeness.  Consider  a 


finite  state,  continuous  times,  completely  ergodic  Markov  process  which  is  con- 
trolled by  a decision  maker.  For  each  of  the  N states  (i) , at  each  transition 


of  the  process,  the  decision  maker  chooses  an  action  k = 1,  ...,  This 

k ,k  k 

action  results  in  a transition  rate  a^^  and  a reward  (cost)  rate  q^.  a^^  is 

defined  as  follows:  In  an  increment  of  time  dt,  the  process,  now  in  state  i 

and  under  policy  k,  will  move  to  state  j with  probability  (i  j)*  The 

2 

probability  of  two  or  more  state  transitions  is  of  the  order  dt  or  higher  and 


is  assumed  to  be  zero  if  dt  is  taken  sufficiently  small,  q^  is  the  expected 
reward  (cost)  rate  incurred  over  a residence  in  state  i using  action  k. 


Howard  [7]  showed  that  for  a given  policy  set,  the  simultaneous  set  of 
linear  equations. 


Howard  also  showed  that  the  optimal  gain  for  the  process  could  be  obtained  using 


a simple  policy  iterative  algorithm.  The  algorithm  is  the  same  as  that  given  in 
Figure  1,  except  that  equations  (2)  must  be  substituted  for  Step  C,  and  Step  B 
must  be  appropriately  modified. 


Inally , consi 


1 


process.  This  is  essentially  the  same  as  the  discrete  time  decision  process 
described  earlier  in  that  there  is  an  underlying  Markov  process  with  transition 
probabilities  p^j . However,  the  holding  (transition)  time  (m)  in  going  from 
state  i to  j is  described  by  the  density  function  h^^ (m) , 0 < m < ».  The 
expected  holding  (transition)  time,  given  the  system  starts  in  state  i is 


N ^ 

^ Pii  mh..(m)dm  > 0 
j=l  ^ ^ 


Jewell  [ 8 ] showed  under  rather  general  assumptions  that  for  a given  policy 
set,  the  simultaneous  set  of  linear  equations. 


could  be  solved  to  compute  the  gain  g of  the  process.  Jewell  also  showed  that 
the  optimal  gain  for  the  process  could  be  obtained  using  a 
modified  version  of  Howard's  simple  policy  iterative  algorithm  by  substituting 
equations  (4)  for  Step  C and 


for  the  test  function  in  Step  B of  Figure  1. 
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White's  Method  and  Problem  Transformations 


It  is  easy  to  see  from  Figure  1 that  for  each  of  the  processes  described 
that  the  bulk  of  the  computational  effort  in  the  algorithm  lies  in  solving  the 
simultaneous  set  of  linear  equations  (Step  C) . For  large  processes,  straight 
forward  techniques,  such  as  Gaussian  Elimination,  quickly  become  untenable. 

In  an  elegant  paper.  White  [25]  proposed  a successive  approximation  approach 
for  the  undiscounted,  discrete  time,  Markov  decision  process.  Odoni  [16]  added 
bounds  for  g which  are  useful  in  termination  decisions.  We  can  also  relax 
the  complete  ergodicity  requirements  in  line  with  those  given  in  footnote  1. 

The  White-Odoni  technique  can  be  summarized  as  follows: 

Assume  we  have  computed  sets  of  values  V^(n-l) , v^(n-l) , i=l,  ...,  N and 
a quantity  8^^  then  compute  a r ■‘t 


v^(n)  = V^(n)  - g^, 

L"(n)  = max{V^(n)  - v^(n-l) } 
i 


L' (n)  » min{V^(n)  - v^(n-l)} 
i 

where  M is  a state  of  the  process  such  that  for  all  sets  of  policies  and  some 

integer  u > 0,  the  probability  of  reaching  state  M in  u transitions,  starting 

in  any  state  i,  is  nonzero  for  all  states  i.  White  showed  that  the  repeated 

application  of  equations  (6)  will  converge.^  In  other  words, 

lim  v^(n)  = v^,  and 
n-x” 

lim  g “ g,  where 
n 

n-Kxi 


6 


and  g are  as  deinfed  for  equations  (1) . Odoni  showed  that 

L"(n)  > L"(n+1)  > g > L' (n+1)  > L' (n)  . 

— — n — — 

In  practice.  White's  alglorithm  has  proven  to  be  very  effective  for  large 
scale  systems.  The  iterative  procedure  is  stable  and  self-correcting  and,  since 
no  new  data  are  created  (except  for  the  working  vectors) , storage  requirements 
are  fixed.  Along  these  lines,  it  pays  to  take  advantage  of  any  supersparsity 
[9]  (the  vast  majority  of  large  scale  processes  do  have  very  sparse  transition 
matrices)  so  that  the  procedure  can  take  place  entirely  in-core. 

While  straight  forward  application  of  White's  approach  does  not,  in  general, 
work  for  continuous  time,  and  semi-Markov  processes,  these  processes  can  be 
transformed  to  a form  compatible  with  White's  approach.  Consider  equations  (2) 
with  v^  added  to  both  sides  of  the  equation. 


(7) 

V - 0 
N 

Noting  the  definition  of  a^^^  (equation  (3))  , then  if 

(8)  ° ’ “li  ■ - I " -!■  1 - 1-  ••••  “• 

11  Ij 


equation  (7)  is  of  the  same  form  as  equation  (1).  This  is  easily  seen  by  noting 
that  if  (7)  holds,  then 

I + (1  -f  ) - 1 i - 1.  ....  S. 

j^i  ij  11 

a^j  ^0  i j . and 

l+a,,>0  i*l,...,N. 


Substituting  1+a^^  for  a^^  in  the  rate  matrix,  it  is  seen  that  the  new  matrix 
{a^j},  for  each  action  set,  has  all  the  properties  of  a stochastic  matrix. 

As  a consequence,  if  (8)  holds,  it  follows  that  White's  method  can  indeed 


be  used  to  solve  the  continuous  time  Markov  decision  process.  The  procedure 


1,  Let 


niax 


k—  L 9 • • • ) K. , 


(Note:  a >0) 
max 

2.  Divide  all  a^^  and  i,  j = 1, 


N,  k = 1,  . . . , K 


by  b > a . Condition  (8)  is  now  satisfied. 

^ max 

k k 

3.  Using  the  new  a^^  and  q^,  solve  the  problem  using  White’s  method. 

A.  To  express  the  results  in  terms  of  the  original  continuous  process, 
multiply  the  gain  g by  b.  The  optimal  policy  and  relative  rewards 
(costs),  v^,  obtained  are  valid  for  the  original  process. 

Figure  2 

Scaling  Procedure 


t A** 


in  Figure  2 will  convert  a suitable  continuous  time  problem  so  that  the  consi- 
tlons  of  (8)  will  hold. 


Note  that  the  scaling  of  the  problem  really  amounts  to  changing  the  time 
frame  of  the  problem.  For  instance,  if  the  process  is  stated  in  teras  of  per 

60  then  the  transformation 


k k 

minute  (a,.)  and  dollars  per  minute  (q^),  and  a 
ij 


simply  converts  the  time  frame  to  hour  units.  It  is  readily  seen  that  it  is 

necessary  to  divide  by  at  least  a (a  >1)  to  end  with  a stochastic  {a..} 

^ max  max  ij 

matrix.  The  question  of  interest  is:  Can  the  convergence  rate  of  White's  method 
be  improved  by  using  a proper  choice  of  b > We  consider  that  question 

shortly;  but  first,  let  us  address  the  semi-Markov  decision  process. 

Consider  equations  (4)  with  the  relative  reward  (cost)  v^  moved  to  the  right 
hand  side  of  the  equation  and  both  sides  divided  by  the  expected  holding 


(transition)  time  T^. 


(9) 


k k 


(P,i  - i> 


0 


Letting  = a^/T^ 


ij 

k 

^ii 


k /„k  , 

Pij/Ti  , and 


(P^i  - 1)/T^ 


it  is  readily  seen  that  equations  (9)  are  of  the  same  form  as  equations  (2) , the 
continuous  time  Markov  decision  process.  As  a consequence,  the  transformation 
can  also  be  applied  to  equations  (9)  to  facilitate  solution  of  the  semi-Markov 
decision  process.  It  should  be  noted  that  the  transf  .-'onation  is  equivalent  to 
one  developed  by  Schweitzer  [20]. 

We  now  turn  our  attention  to  the  problem  of  speeding  the  convergence  of 
White's  algorithm. 
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Convergence  Facilitation 

There  are  several  procedures  that  have  been  used  in  accelerating  convergence 
in  solving  discounted  Markov  decision  processes.  By  and  large,  though,  these 
have  not  been  examined  extensively  in  the  non-discounted  Markov  decision  pro- 
cess context.  Briefly,  the  acceleration  techniques  include  (a)  problem  trans- 
formation, (b)  cheap  iterations,  (c)  suboptimal  activity  elimination,  and  (d) 
extrapolation  procedures.  We  will  discuss  each  of  these,  in  turn,  in  the  con- 
text of  the  non-discounted  Markov  decision  process. 

(a)  Problem  Transformation 

In  solving  (generalized) discounted  Markov  decision  processes,  it  is 
well  known  that  the  largest  spectral  radius  of  the  transition  matrices  (i.e., 
the  process  spectral  radius)  governs  the  asymptotic  convergence  rate.  Porteus 
[18],  Totten  [24]  and  others  have  devised  problem  transformations  to  reduce 
the  process  spectral  radius.  Morton  and  Wecker  [14]  have 

shown  that  asymptotic  relative  values  and  policy 
convergence  are  at  least  of  order  (aX)'^  where  X is  greater  than  the  subdom- 
2 

inant  eigenvalue  and  0<a«»  is  the  discount  factor.  A reasonable  question  to 
ask  is  whether  the  choice  of  b in  Step  2 (Figure  2)  can  be  made  to  reduce  the 
modulus  of  the  subdominant  eigenvalue  of  the  transition  matrix  of  the  optimal 
policy. 

The  transition  matrix  for  policy  6 resulting  from  the  procedure  of  Figure  2 
is 

I + A.  , where 

D 0 

Let  X and  x be  an  eigenvalue  and  associated  eigenvector,  respectively,  of  the 

starting  transition  matrix  I + — ^ — A..  Then 

max 


10 


(10) 


a b - a 

max  . . max 

b b 


is  an  eigenvalue  of  I + — with  x its  associated  eigenvector.  Now  clearly 


b - a 


reX  + 


max  ^ , 

-r >.  reX 

D — 


where  reX  is  the  real  part  of  X with  -l<reX<.l  and  b>a  >0.  However,  it  may 

=■  — maxf= 


not  be  true  that 


b - a 


Suppose  6 indexes  an  optimal  policy  and  X is  a subdominant  eigenvalue  asso- 
ciated with  this  policy.  Expanding  the  square  of  the  modulus  of  both  sides  of 
(11)  with  A = Xj^  + X2i  gives  that  a reduction  in  the  modulus  of  X requires 


b - a ~ 

l_X  ^ X + < X ^ 

max 


If  X„=0,  then  either  X,»l  and  no  reduction  can  be  made  or  X,  < (a  -b)/(a  +b) 

i 1 1 = max  max 

and  X^is  necessary  negative.  In  this  case,  it  would  appear  that 
any  will  yield  a resultant  benefit  in  asymptotic  convergence.  However, 

this  is  not  necessarily  true,  since  we  may  "bump"  into  another  eigenvalue. 

That  is.  Increasing  b to  decrease  the  absolute  value  of  the  dominant  (negative) 
eigenvalue  will  eventually  result  in  some  other  (positive)  eigenvalue  in- 
creasing until  it  becomes  the  new  subdominant  eigenvalue.  At  that  point 
further  increases  in  b will  not  improve  the  convergence  rate. 

The  following  example  illustrates  an  extreme  improvement  from  problem 


transformation. 


-.3  .3 


.5  -.5 


b“  8 

with  a spectrum  of  {1,  -.6}.  For  b>.5  we  have  a spectrum  {1,  — r^}.  Here 

b 

we  want  b*.8  which  gives  a modulus  of  0.0  for  the  subdominant  eigenvalue  of 
the  transformed  process. 

As  a further  example,  consider  the  Markov  process  whose  transition  matrix 
is  given  as  follows: 


.31 

.13 

.21 

.05 

.10 

.20 

.15 

.12 

.16 

.20 

.12 

.25 

CM 

O 

.01 

.01 

.01 

.93 

.02 

.12 

.28 

.09 

.16 

.04 

.31 

0 

.01 

.85 

0 

.09 

.05 

.11 

.30 

.10 

.15 

.14 

.20 

The  eigenvalues  are  1.0,  -.8421,  .6945,  .2079,  -.085  + .01161,  and 
-.085  - .01161.  It  would  appear  that  problem  transformation  should  be  of  value 
in  speeding  convergence,  since  the  subdominant  eigenvalue  is  negative.  From 
the  preceding  development,  it  would  be  expected  that  the  convergence  rate  of 
the  process  would  be  maximized  at  the  value  of  "b"  which  results  in  the  largest 
negative  eigenvalue  being  equal  to  the  largest  positive  eigenvalue.  Applying 
equation  (10) , to  equate  the  two  eigenvalues  of  the  tranformed  matrix,  we  get 


(.8421)  - 

D D 


(.6945)  + 

D D 


Solving,  we  get  b * 1.063.  In  other  words,  transforming  the  process  using 

b ■ 1.063  should  achieve  the  "best"  asymptotic  convergence  for  the  process. 

As  a test.  White's  Algorithm  was  run  using  costs  of 

q - (1.14,  2.27,  5.06,  2.97,  3.96,  4.90) 

(only  one  policy  per  state) . The  problem  was  declared  "solved"  when 
-4 

L"(n)  - L'(n)  ^ 10  . Runs  were  made  for  various  values  of  b (see  Figure  3). 

The  actual  minimum  number  of  iterations  (30)  occurred  for  a value  of  b = 1.09, 
whereas  the  number  of  iterations  for  b ' 1.063  was  slightly  higher  (31).  The 


ray 
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inaccuracy  in  prediction  is  expected,  since  the  method  of  prediction  considers 
only  main  effects  and  ignores  the  contribution  of  the  smaller  eigenvalues. 

As  one  might  expect,  the  straightforward  application  of  the  above 
observations  is  not  practical,  since  the  determination  of  eigenvalues  for 
large  processes  is  itself  difficult.  However,  in  practice  it  is  usually 
intuitively  obvious  to  the  analyst  that  a process  may  possess  strong  cyclic 
tendencies,  indicating  that  some  eigenvalue  has  a large  negative  real  com- 
ponent. If  the  cyclic  tendency  is  strong  enough,  this  eigenvalue  will  be 
the  subdominant  eigenvalue  and  the  above  development  suggests  that  some 
may  decrease  the  resulting  asymptotic  convergence  rate.  In  any  event,  ap- 
plying White's  method,  using  several  values  of  b marginally  larger  than 
and  noting  the  convergence  rate  of  the  process  for  various  values  of  b can 
many  times  be  of  value. 

In  testing  the  above  we  noted  that  if  b was  made  successively  slightly 
larger  than  that  either  .the  convergence  improved  dramatically  or  the  con- 

vergence slightly  deteriorated.  To  further  test  this  observation,  we  ran- 
domly generated  Markov  decision  problems  with  a varied  number  of  states. 

Within  each  state  ten  different  actions  were  available.  White's  method  was 
used  to  solve  each  using  b values  of 


l.OSbg 

l.lObg 

l.lSbg 


-4 

Again,  problems  were  declared  "solved"  at  iteration  n when  L"(n)  - L'(n)  ^ 10 
If  a problem  was  solved  in  fewer  iterations  for  some  b^  than  b^  with  i> j , 
then  the  problem  transformation  was  declared  beneficial.  Otherwise  the  trans- 
formation was  classified  as  non-benef icial.  Clearly  a problem  could  be 
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NUMBER 

OF 

STATES 


TOTAL  ITERATION  COUNTS 


Table  1:  Total  Iteration  Counts 
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mislabelled  as  non-benef icial  using  the  grid  given  above  but  may  in  fact  be 

beneficial  for  some  b>a  • The  opposite  is  not  the  case. 

max 

Table  1 gives  the  total  number  of  iterations  to  solve  the  non-benef icial 

and  beneficial  problem  cases.  N and  N,  stand  for  the  number  of  problems 

n b 

labelled  non-benef icial  and  beneficial,  respectively.  For  example,  in  the 
randomly  generated  8 state  problems,  7 problems  were  labelled  non-benef icial 
and  8 were  labelled  beneficial.  Table  2 summarizes  Table  1 by  providing  totals 
across  states  and  then  averages  across  states  and  over 
problems. 

If  we  can  assume  that  the  average  performance  of  the  set  of  randomly 
generated  problems  used  in  this  study  is  representative  of  the  performance 
of  the  set  of  real  world  problems,  then  the  following  observations  can  be 
made.  First,  problems  whose  convergence  can  be  improved  by  increases  in  b 
above  a^^^  are  those  problems  that  are  hard  to  solve  anyway  (see  Table  2, 

19.5  versus  35.4  iterations).  Second,  when  a problem  does  not  show  conver- 
gence improvement  when  b is  increased  above  the  deterioration  in  conver- 

gence speed  is  not  dramatic  (see  Table  2,  19.5  versus  22.8  iterations,  for  a 
15%  increase  in  b above  • Finally,  convergence  Improvements,  when  they 

occur,  are  rather  dramatic  (see  Table  2,  35.4  to  18.5  iterations  for  a 15% 
change  in  b above  • These  observations  suggest  using  problem  transfor- 

mation can  be  of  significant  value  in  speeding  converge  nee. 

(b)  Cheap  Iterations 

Cheap  Iterations  were  first  noted  by  Morton  [13]  and  discussed  in  detail 
by  Zaldivar  and  Hodgson  [26].  "Cheap  Iterations"  are  accomplished  simply  by  ' 
not  performing  policy  maximization  at  every  iteration  of  White's  method.  If 
one  does  not  perform  a policy  maximization  the  computational  effort  per  iter- 
ation is  reduced  considerably.  This  approach  makes  sense  intuitively  in  that 
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there  are  both  policy  sets,  and  relative  values  (v^)  and  gain  (g)  converging 
in  the  operation  of  White's  method.  Using  cheap  iterations  allows  the  rela- 
tive values  (v^)  and  gain  (g)  to  converge  sufficiently  so  that  a new  policy 
set  can  be  chosen  which  is  significantly  better  than  the  old  one.  Clearly, 
in  practice,  there  is  an  optimum  tradeoff  between  "cheap"  and  "expensive" 
iterations.  In  our  experience,  we  have  used  from  5 to  30  cheap  iterations 
per  policy  maximization.  The  number  being  dependent  on  the  convergence  pro- 
perties of  the  process. 


(c)  Suboptimal  Activity  Elimination 

An  extremely  useful  procedure  in  dynamic  programming  methodology  is  the 
reduction  of  the  policy  space  by  determining  actions  that  could  never  be  part 
of  an  optimal  policy.  These  actions  can  be  eliminated  from  further  consideration. 
Hence,  the  problem  can  actually  shrink  in  size  as  the  computations  progress.  The 
idea  of  eliminating  suboptimal  activities  was  first  given  by  MacQueen  [11]  in 


the  discounted  Markov  decision  context  and  refined  by  others  [4,  5,  18,  24]. 

The  basic  idea,  cast  in  the  non-discounted  Markov  decision  process  context. 


is  to  first  determine  bounds  on  (^)  at  iteration  n (call  them  and  u’^)  and 

s 

then  test,  for  each  activity  k associated  with  state  i,  whether  the  system 


(U)  g + 

1°-  < C)  < u“ 

~ g ” 

has  a solution.  If  not,  k cannot  be  part  of  an  optimal  policy  and  can  be  removed 
from  further  consideration. 

For  the  discounted  Markov  decision  process,  several  researchers  have  pre- 
sented bounds  [lO,  18,  19,  24].  However,  even  though  w*^  -*■  w and  g*^  g,  no  bounds 

3 

have  been  given  for  the  non-discounted  case. 
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Recently  Hastings  [ 3 ] has  proposed  a suboptimality  test.  His  test  identi- 
fies non-optimal  actions  for  state  i at  value  iteration  stage  n.  This  does  not 

mean  that  the  detected  "non-optimal"  actions  are  non-optimal  for  subsequent 
4 

stages.  Thus,  actions  flagged  at  iteration  n must  be  re-examined  at  some 
later  stage.  Hasting's  test  can  be  thought  of  as  an  "intelligent"  inter- 
mediate to  expensive  and  cheap  iterations. 

We  might  note  here  that  any  type  of  relaxed  iteration  (cheap  or  Hasting's) 
will  invalidate  the  bounds  given  by  Odoni.  That  is,  to  be  valid  bounds, 

L"(n)  and  L' (n)  can  only  be  determined  from  the  unrelaxed  iteration  (i.e.,  the 
expensive  iteration) . 

(d)  Extrapolation  Methods 

Generally,  the  convergence  of  the  relative  values  (v^)  to  their  respective 
values  takes  place  in  an  orderly  fashion  so  that  it  is  possible  to  make  educated 
guesses  at  the  final  values  of  the  v^'s,  thereby  speeding  convergence  of  the 
algorithm.  Simple  approaches  such  as  linearly  extrapolating  each  of  the  trends 
of  the  progressions  of  the  v^'s  seem  to  be  most  effective.  For 
discussion,  see  [27]. 


a more  complete 


FOOTNOTES 


1.  The  assumptions  used  by  White  can  be  relaxed.  Schweitzer  [22]  proved 
convergence  for  the  general  single  chain  acyclic  process  while  Su  and 
Deininger  [23]  extended  this  to  the  periodic  case.  Such  conditions  are 
hard  to  test  in  practice.  Recently  Platzman  [17]  has  given  a weaker 
condition  that  can  be  readily  tested.  Finally,  Morton  and  Wecker  [14] 
have  generalized  most  of  the  above  plus  have  added  some  new  dimensions 
to  the  algorithm. 

2.  The  largest  eigenvalue  is  always  1.0.  The  subdominant  eigenvalue  is  the 
remaining  eigenvalue  having  the  largest  modulus. 

3.  We  warn  the  reader  that  the  "bounds" 

u^  = V^(n)  + L"(n)  - L'(n) 

= V^(n)  - L"(n)  + L'  (n) 

do  not  (in  genera-1)  satisfy 

u”  ^ V^(n)  i m ^ n. 

All  that  can  be  shown  for  these  bounds  is  that 

u^  ^ V^(m)  ^ n"  n-1  4 m 4 n+1. 

4.  Under  fairly  mild  conditions,  Hastings  [ 3 ] shows  that  there  is  a stage 
after  which  non-optimal  actions  will  be  properly  identified. 
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